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Abstract 

Sparse modeling is a powerful framework for data analysis and processing. Traditionally, encoding in this 
framework is performed by solving an £i -regularized linear regression problem, commonly referred to as Lasso 
or Basis Pursuit. In this work we combine the sparsity-inducing property of the Lasso at the individual feature 
level, with the block-sparsity property of the Group Lasso, where sparse groups of features are jointly encoded, 
obtaining a sparsity pattern hierarchically structured. This results in the Hierarchical Lasso (HiLasso), which shows 
important practical advantages. We then extend this approach to the collaborative case, where a set of simultaneously 
coded signals share the same sparsity pattern at the higher (group) level, but not necessarily at the lower (inside 
the group) level, obtaining the collaborative HiLasso model (C-HiLasso). Such signals then share the same active 
groups, or classes, but not necessarily the same active set. This model is very well suited for applications such 
as source identification and separation. An efficient optimization procedure, which guarantees convergence to the 
global optimum, is developed for these new models. The underlying presentation of the framework and optimization 
approach is complemented by experimental examples and theoretical results regarding recovery guarantees. 

L Introduction and Motivation 

Sparse signal modeling has been shown to lead to numerous state-of-the-art results in signal processing, in 
addition to being very attractive at the theoretical level. The standard model assumes that a signal can be efficiently 
represented by a sparse linear combination of atoms from a given or learned dictionary. The selected atoms form 
what is usually referred to as the active set, whose cardinality is significantly smaller than the size of the dictionary 
and the dimension of the signal. 

In recent years, it has been shown that adding structural constraints to this active set has value both at the level 
of representation robustness and at the level of signal interpretation (in particular when the active set indicates 
some physical properties of the signal); see IT], ||2|, 111 and references therein. This leads to group or structured 
sparse coding, where instead of considering the atoms as singletons, the atoms are grouped, and a few groups are 
active at a time. An alternative way to add structure (and robustness) to the problem is to consider the simultaneous 
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encoding of multiple signals, requesting that they all share the same active set. This is a natural collaborative 
filtering approach to sparse coding; see, for example, H, 0, |l6l, 0, 0, lU. 

In this work we extend these approaches in a number of directions. First, we present a hierarchical sparse model, 
where not only a few (sparse) groups of atoms are active at a time, but also each group enjoys internal sparsity[^ 
At the conceptual level, this means that the signal is represented by a few groups (classes), and inside each group 
only a few members are active at a time. A simple example of this is a piece of music (numerous applications in 
genomics and image processing exist as well), where only a few instruments are active at a time (each instrument is 
a group), and the sound produced by each instrument at each instant is efficiently represented by a few atoms of the 
sub-dictionary/group corresponding to it. Thereby, this proposed hierarchical sparse coding framework permits to 
efficiently perform source identification and separation, where the individual sources (classes/groups) that generated 
the signal are identified at the same time as their representation is reconstructed (via the sparse code inside the 
group). An efficient optimization procedure, guaranteed to converge to the global optimum, is proposed to solve the 
hierarchical sparse coding problems that arise in our framework. Theoretical recovery bounds are derived, which 
guarantee that the output of the optimization algorithm is the true underlying signal. 

Next, we go one step beyond. Continuing with the above example, if we know that the same few instruments will 
be playing simultaneously during different passages of the piece, then we can assume that the active groups at each 
instant, within the same passage, will be the same. We can exploit this information by applying the new hierarchical 
sparse coding approach in a collaborative way, enforcing that the same groups will be active at all instants within 
a passage (since they are of the same instruments and then efficiently representable by the same sub-dictionaries), 
while allowing each group for each music instant to have its own unique internal sparsity pattern (depending on 
how the sound of each instrument is represented at each instant). We propose a collaborative hierarchical sparse 
coding framework following this approach, (C-HiLasso), along with an efficient optimization procedure. We then 
comment on results regarding the correct recovery of the underlying active groups. 

The proposed optimization techniques for both HiLasso and C-HiLasso is based on the Proximal Method ITOl . 
more specifically, on its particular implementation for sparse problems. Sparse Reconstruction by Separable 
Approximation (spaRSA) [fTT|. This is an iterative method which solves a subproblem at each iteration which, 
in our case, has a closed form and can be solved in linear time. Furthermore, this closed form solution combines 
a vector thresholding and a scalar thresholding, naturally yielding to the desired hierarchical sparsity patterns. 

The rest of the paper is organized as follows: Section [ill provides an introduction to traditional sparse modeUng 
and presents our proposed HiLasso and C-HiLasso models. We discuss their relationship with the recent works 
of 0, lfT2l . IIT3I . llT4l . ifTSll . lfT6l . In Section III we describe the optimization techniques applied to solve the 
resulting sparse coding problems and we discuss its relationship with other optimization methods recently proposed 
in the literature ifTSl . ifTTl . Theoretical recovery guarantees for HiLasso in the noiseless setting are developed in 

'while we consider only 2 levels of sparsity, the proposed framework is easily extended to multiple levels. 
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Section IV demonstrating improved performance when compared with Lasso and Group Lasso. We also comment 
on existing results regarding correct recovery of group-sparse patterns in the collaborative case. Experimental results 
and simulations are given in Section |V] and finally concluding remarks are presented in Section |VI| 



II. Collaborative Hierarchical Sparse Coding 

A. Background: Lasso and Group Lasso 

Assume we have a set of data samples Xj G ffi™, j = 1, . . . ,n, and a dictionary of p atoms in M™, assembled as 
a matrix D e E™^^, D = [did2 . . . dp]. Each sample can be written as = Da^ + e, aj eW, e e R"\ that 
is, as a linear combination of the atoms in the dictionary D plus some perturbation e, satisfying ||e||2 ^ llxjUj- 
The basic underlying assumption in sparse modeling is that, for all or most j, the "optimal" a.j has only a few 
nonzero elements. Formally, if we define the £o cost as the pseudo-norm counting the number of nonzero elements 
of aj, ||aj||p := |{fc : akj 7^ 0}|, then we expect that ||aj||p <C p and ||aj||p ^ m for all or most j. 

Seeking the sparsest representation a is known to be NP-hard. To determine a^ in practice, a multitude of efficient 
algorithms have been proposed, which achieve high correct recovery rates. The £1 -minimization method is the most 
extensively studied recovery technique. In this approach, the non-convex £0 norm is replaced by the convex £1 
norm, leading to 

min llalL s.t. Ilx, - Dall^ < e. (Ill) 

The use of general purpose or specialized convex optimization techniques allows for efficient reconstruction using 
this strategy. The above approximation is known as the Lasso ifTSl or Basis Pursuit lfT9ll . EOl . A popular variant is 
to use the unconstrained version 

minJl|x,-Da||^ + A||a||,, (II.2) 

where A is an appropriate parameter value, usually found by cross-validation, or based on statistical principles f2T\ . 

The fact that the || ||j^ regularizer induces sparsity in the solution a^ is desirable not only from a regularization 
point of view, but also from a model selection perspective, where one wants to identify the relevant factors (atoms) 
that conform each sample Xj. In many situations, however, the goal is to represent the relevant factors not as 
singletons but as groups of atoms. For a dictionary of p atoms, we define groups of atoms through their indices, 
G C {1, . . . Given a group G of indexes, we denote the sub-dictionary of the columns indexed by them as 
D[(3], and the corresponding set of reconstruction coefficients as a[Q]. Define Q = {Gi, . . . , G^} to be a partition 
of {1, . . . ,p}|^In order to perform model selection at the group level (relative to the partition Q), the Group Lasso 
problem was introduced in UJ, 

1 2 

min - ||x,- - Da|L + \ipg{a), (II.3) 

^While in this paper we concentrate and develop tlie important non-overlapping case, it will be clear that the concepts of collaborative 
hierarchical sparse modeling introduced here apply to the case of overlapping groups as well. 
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Fig. 1. Sparsity patterns induced by HiLasso (left) and C-HiLasso (riglit) model selection programs. Notice that the C-HiLasso imposes the 
same group-sparsity pattern in all the samples (same class), whereas the in-group spai'sity patterns can vary between samples (samples themselves 
are different). 



where tpg is the Group Lasso regularizer defined in terms of Q as V'e(^) ■= X^gsS II^[G]||2- "^^^ function ipg can 
be seen as a generalization of the £i regularizer, as the latter arises from the special case Q = {{1}, {2}, . . . , {p}} 
(the groups are singletons), and as such, its effect on the groups of a is also a natural generalization of the one 
obtained with the Lasso: it "turns on/off" atoms in groups. 

We can always consider the "noiseless" sparse coding problem miriaeRp {V'(a) • = Da}, for a generic 
regularizer ?/'(■)' the limit of the Lagrangian sparse coding problem miiiagup || W^j ^ Da||2 + A?/;(a)| when 
A — > 0. In the remainder of this section, as well as in Section III we only present the corresponding Lagrangian 
formulations. 



B. The Hierarchical Lasso 



The Group Lasso trades sparsity at the single-coefficient level with sparsity at a group level, while, inside each 
group, the solution is generally dense. Let us consider for example that each group is a sub-dictionary trained to 
efficiently represent, via sparse modeling, an instrument, a type of image, or a given class of signals in general. 
The entire dictionary D is then appropriate to represent all classes of the signal as well as mixtures of them, and 
Group Lasso will properly represent (dense) mixtures with one group or sub-dictionary per class. At the same time, 
since each class is properly represented in a sparse mode via its corresponding group or sub-dictionary, we expect 
sparsity inside its groups as well (which is not achieved by Group Lasso, whose solutions are dense inside each 
group). This will become even more critical in the collaborative case, where signals will share groups because they 
are of the same class, but will not necessarily share the full active sets, since they are not the same signal. To 
achieve the desired in-group sparsity, we simply re-introduce the £i regularizer together with the group regularizer, 
leading to the proposed Hierarchical Lasso (HiLasso) modelj^ 



1 2 

II^J' ^ °^ll2 + ^2^/'e(a) + Ai llalli 



(11.4) 



The hierarchical sparsity pattern produced by the solutions of (II.4i is depicted in Figure [TJleft). For simplicity of 
the description, we assume that all the groups have the same number of elements. The extension to the general case 

'We can similarly define a hierarchical sparsity model with instead of ii. 
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is obtained by multiplying each group norm by the square root of the corresponding group size. This model then 
achieves the desired effect of promoting sparsity at the group/class level while at the same time leading to overall 
sparse feature selection. As mentioned above, additional levels of hierarchy can be considered as well, e.g., with 
groups inside the blocks. This is relevant for example in audio analysis. 

As with models such as Lasso and Group Lasso, the optimal parameters Ai and A2 are application and data 
dependent. In some specific cases, closed form solutions exist for such parameters. For example, for signal restoration 
in the presence of noise using Lasso (A2 = 0), the GSURE method provides a simple way to compute the optimal 
Ai ETi . As extending such methods to HiLasso (or the C-HiLasso model presented next) is beyond the scope 
of this work, we rely on cross-validation for the choice of such parameters. The selection of Ai and A2 has an 
important influence on the sparsity of the obtained solution. Intuitively, as A2/A1 increases, the group constraint 
becomes dominant and the solution tends to be more sparse at a group level but less sparse within groups (see 
Figure |2|. This relation allows in practice to intuitively select a set of parameters that performs well. We also 
noticed empirically that the selection of the parameters is quite robust, since small variations in their numerical 
value don't change considerably the obtained results. 

Some recent modeling frameworks for sparse coding do not rely on the selection of such model parameters, 
e.g., following the Minimum Description Length criterion in ll22l . or non-parametric Bayesian techniques in ll23l . 
Applying such techniques to the here proposed models is subject of future research. 



C. Collaborative Hierarchical Lasso 

In numerous applications, one expects that certain collections of samples Xj share the same active components 
from the dictionary, that is, that the indices of the nonzero coefficients in are the same for all the samples 
in the collection. Imposing such dependency in the £1 regularized regression problem gives rise to the so called 
collaborative (also called "multitask" or "simultaneous") sparse coding problem ||4l, (SI, lID, ll24l . Considering the 
coefficients matrix A = [ai, . . . , a„] e MP^" associated with the reconstruction of the samples X = [xi, . . . , x„] € 
]R™^", this model is given by 

1 ^ 
mill -IIX-DAII^ + aV llaML (II.5) 

fe=i 

where a'^' e M" is the k-th row of A, that is, the vector of the n different values that the coefficient associated 
to the fc-th atom takes for each sample j — 1, ... ,n. If we now extend this idea to the Group Lasso, we obtain a 
collaborative Group Lasso (C-GLasso) formulation, 

min i||X-DA|l^ + AV^e(A), (II.6) 

where '0g(A) = X^Gee II"^'^IIf' '^'^ '■^^ sub-matrix formed by all the rows belonging to group G. This 
regularizer is the natural collaborative extension of the regularizer in pi.3[ ). 



6 



lambdal = 0.125 lambda2=0,25 




80 100 120 140 ISO 60 80 100 120 140 160 



Fig. 2. Effect of different combinations of Ai and A2 on tlie solutions of tlie HiLasso coding problem. Three cases are given in which we 
want to recover a sparse signal (red crosses) ao by means of the solution a of the HiLasso problem (blue dots). In this example we have two 
active groups out of ten possible (the sub dictionaries associated to each group have 30 atoms) and ao = 8 (four non-zero coefficient per active 
group). The estimate that is closest to ao in £1 norm is shown in the top left. As the ratio A2/A1 increases (bottom left), the level sets of 
the regularizer il>g{-) become rounder, thus encouraging denser solutions. This is depicted in the rightmost figure for a simple case of g = 1 
groups. Increasing Ai again (bottom right) increases sparsity, although here the final effect is too strong and some non-zero coefficients are not 
detected. 



In this paper, we take an additional step and treat this together with the hierarchical extension presented in the 
previous section. The combined model that we propose, C-HiLasso, is given by 

n 

min -|lX-DA||^ + A27/^e(A) + Ai V||a,|l^. (II.7) 

The sparsity pattern obtained using ( |II.7| i is shown in Figure [TJright). The C-GLasso is a particular case of our 
model when Ai = 0. On the other hand, one can obtain independent Lasso solutions for each by setting A2 = 0. 



We see that (II.7i encourages all the signals to share the same groups (classes), while the active set inside each 
group is signal dependent. We thereby obtain a collaborative hierarchical sparse model, with collaboration at the 
class level (all signals collaborate to identify the classes), and freedom at the individual levels inside the class to 
adapt to each particular signal. This new model is particularly well suited, for example, when the data vectors have 
missing components. In this case combining the information from all the samples is very important in order to 
obtain a correct representation and model (group) selection. This can be done by slightly changing the data term in 



(II. 7 1. For each data vector one computes the reconstruction error using only the observed elements. Note that 
the missing components do not affect the other terms of the equation. Examples will be shown in Section |V] 
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D. Relationship to Recent Literature 

A number of recent works have addressed hierarchy, grouping and collaboration within the sparse modeUng 
community. We now discuss the ones most closely related to the proposed HiLasso and C-HiLasso models. 

In the authors propose a general framework in which one can define a regularization term to encourage a 
variety of sparsity patterns, and provide theoretical results (different to the ones developed here) for the single- 
signal case. The HiLasso model presented here, in the single signal scenario, can be seen as a particular case of that 
model (where the groups in ||2l should be blocks and singletons), although the particularly and important case of 
hierarchical structure introduced here is not mentioned in that paper. In lfT2l the authors simultaneously (see 1251 ) 
proposed a model that coincides with ours again in the single-signal scenario. None of these approaches develop 
the collaborative framework introduced here, nor the theoretical guarantees. The recovery of mixed signals with 
optimization was addressed in |fT6ll . This model does not include block sparsity (no hierarchy), collaboration, or 
the theoretical results we obtain here. 

The special case of C-HiLasso when Ai = 0, C-GLasso, is investigated in 1*261, where a theoretical analysis of 
the signal recovery properties of the model is developed. Collaborative coding with structured sparsity has also 
been used recently in the context of gene expression analysis ifTJl . llT4l . In lfT3l . the authors propose a model, 
that can be interpreted as a particular case of the collaborative approach presented here, in which a set of signals 
is simultaneously coded using a small (sparse) number of atoms of the dictionary. They modify the classical 
collaborative sparse coding regularization so that each signal can use any subset of the detected atoms. This is 
equivalent to our model when the groups have only one element and therefore there is no hierarchy in the coding. 
A collaborative model is presented in lfT4l . where signals sharing the same active atoms are grouped together in a 
hierarchical way by means of a tree structure. The regularization term proposed is analogous to the one proposed 
in our work, but it is used to group signals rather than atoms (features), having once again no hierarchical coding. 

Tree-based sparse coding has also been used recently to learn dictionaries ifTSl . ifTTl . Under this model, if a 
particular learned atom is not used in the decomposition of a signal, then none of its descendants (in terms of 
the given tree structure) can be used. Although not explicitly considered in these works, the HiLasso model is an 
important particular case, among the wide spectrum of hierarchical sparse models considered in this line of work, 
where the hierarchy has two levels and no single atoms are in the upper level. 

To conclude, while particular instances of the proposed C-HiLasso have been recently reported in the literature, 
none of them are as comprehensive. C-HiLasso includes both collaboration, at a block/group level, and hierarchical 
coding. Such collaborative hierarchical structure is novel and fundamental to address new important problems such 
as collaborative source identification and separation. The new theoretical results presented here extend the block 
sparsity results of O, ll27l . complementing the modehng and algorithmic work. 
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III. Optimization 



A. Single-Signal Problem: HiLasso 



In the last decade, optimization of problems of the form of ( |II.2| i and ( |II.3| ) have been deeply studied, and there 
exist very efficient algorithms for solving them. Recently, Wright et. al ifTTl proposed a framework, SpaRSA, for 
solving the general problem 

min/(a) + AV'(a). (III.8) 

aGRf 

be a smooth and convex function, while i/; : ^ M only needs to be finite and convex in W. This formulation, 
which is a particular case of the Proximal Method framework developed by Nesterov |10|j, includes as important 
particular cases the Lasso, Group-Lasso and HiLasso problems by setting /(•) as the reconstruction error and then 
choosing the corresponding regularizers for V'(-). When the regularizer, is group separable, the optimization 
can be subdivided into smaller problems, one per group. The framework becomes powerful when these sub-problems 
can be solved efficiently. This is the case of the Lasso and Group Lasso (with non overlapping groups) settings, 
and also of the HiLasso, as we will show later in this Section. In all cases, the solution of the sub-problems are 
obtained in linear time. 

The SpaRSA algorithm generates a sequence of iterates {a^'^jtgN that, under certain conditions, converges to the 
solution of ( |III.8| l. At each iteration, a(*+^) is obtained by solving 



min (z-a^'O^V/(a^*0 



v(t) 



(111.9) 



for a sequence of parameters {a^*^}jgNi ot^^'^ — a^rf, where ao > and 77 > 1 need to be chosen properly for the 
algorithm to converge (see ifTTI for details). It is easy to show that (III.9l is equivalent to 

-^^(z), (IILIO) 



. 1 

mm - 

z6RP 2 



(*) 



where u*^*^ = a*^*'' — ^^V/(a*^*^). In this new formulation, it is clear that the first term in the cost function 
can be separated element-wise. Thus, when the regularization function is group separable, so is the overall 



optimization, and one can solve (III. 10 1 independently for each group, leading to 



,(t+i) _ 
\g] - 



. 1 

arg mm - 

zGRlGI 2 



z — u 



A 



Z[|3] being the corresponding variable for the group. In the case of HiLasso, this becomes, 

(t + l) • 1 II ||2 , Al 

= arg mm - ||z - ^\\^ + — - ||z||2 + ||z||^ , 
' ' zGKiGi 2 ay^> ay^i 

is a second order cone program (SOCP), for which one could 



(IILll) 



where we have defined w — Problem dlll.ll 



[G]- 



use generic solvers. However, since it needs to be solved many times within the SpaRSA iterations, it is crucial to 



solve it efficiently. It turns out that ( III. 1 1 1 admits a closed form solution with cost linear in the dimension of w. 



By inspecting the subgradient of (III. 11 1 for the case where the optimum z* 7^ 0, 

^ A2 ^ 



v^r - 1 



z* e Ai9|lz* 



1 ' 
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Input: Data X, dictionary D, group set Q, constants ao > 0, »; > 1, c > 0, < Omin < ctmax 
Output: Tlie optimal point a* 
InitiaUze t := 0, a(") := 0; 
while stopping criterion is not satisfied do 
choose aW e [amin, ctmax]; 
set u(*) := a(*) - ^V/(aW); 
while stopping criterion is not satisfied do 

// Here we use the group separability of jlll. 10) and solve jlll.l 1) for each group 
for i := 1 to q Ao 

Compute aj^^^' as the solution to III. 13 ; 

end 

set := 

end 

set t := t + 1 ; 

end 

Algorithm 1: HiLasso optimization algorithm. 

where we have defined A2 = \2/cS^^ and Ai = Ai/a'^*^. If we now define C(z*) = 1 + A2/ ||z*||2^ we observe 
that each element of C(z*)z* is the solution of the well known scalar soft thresholding operator, 

sgn(w^)max{0, \wi\ - Ai} ^ , i^l,...,g, (III.12) 



* C(z*) ^ " ' " C{z*) ' 

where we have defined hi = sgn{wi) niax{0, \wi \ — Ai}, the result of the scalar thresholding of w. Taking squares 



on both sides of (III.12i and summing over i = 1, . . . , g we obtain 

IK||^^c^(z*)Hh||^= '1^*5 iihii^ 

from which the equality ||z*||2 = Ilh*|l2 — A2 follows. Since all terms are positive, this can only hold as long as 
||h*||, > A2, which gives us a vector thresholding condition on the solution z* in terms of ||h||2. It is easy to show 
that II h* II 2 < A2 is a sufficient condition for z* = 0. Thus we obtain, 

r max{0||h||,-A.} ^ ||h||2>0 

^ I " (III.I3) 

" 1 , ||h||2-0. 

The above expression requires g scalar thresholding operations, and one vector thresholding, which is also linear 



with respecto to the group size g. Therefore, for all groups, the cost of solving the subproblem (III. 11 1 is linear 
in m, the same as for Lasso and Group Lasso. The complete HiLasso optimization algorithm is summarized in 
Algorithm [T] The parameter 77 has very little influence in the overall performance (see ifTTl for details); we used 
?7 2 in all our experiments. Note that, as expected, the solution to the sub-problem for the cases A2 = or Ai = 0, 
corresponds respectively to scalar soft thresholding and vector soft thresholding. In particular, when A2 = 0, the 
proposed optimization reduces to the Iterative Soft Thresholding algorithm 
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B. Optimization of the Collaborative HiLasso 



The multi-signal (collaborative) case is equivalent to the one-dimensional case where the signal is a concatenation 
of the columns of X, and the dictionary is an nmxnp block-diagonal matrix, where each of the n blocks is a copy 
of the original dictionary D. However, in practice, it is not needed to build such (possibly very large) dictionary, 
and we can operate directly with the matrices D and X to find A. If we define the matrix U'-*^ G M™^" whose 
i-th column is given by u^*"* = a^'' ^jy V/(a|*''), we get the following SpaRSA iterates. 



arg mm - 

ZGR"X" 2 



u(*) 



v(t) 



"Jill ' 



J=l 



which again is group separable, so that it can be solved as q independent problems in the corresponding bands of 



(A(*+i))f 



. 1 

arg mm - 

zeRsx" 2 



(U(*))G 



El 

j=i 



''Jill ■ 



The correspondent closed form solutions for these subproblems, which are obtained in an analogous way to ( |III.12| l- 



(III. 13 1, are given by 



max{0, II H| 

^^(«+i)^G ^ ) ml 



-As} 



H 



and we have defined W 





(U(*))G. 



|H|1^>0 
|H||^ = 



h^j = sgn{wij) max{0, | 



Ai}, 



(III. 14) 



As mentioned in Section n-D| ifTTl addresses a wide spectrum of hierarchical sparse models for coding and 
dictionary learning. They propose a proximal method optimization procedure that, when restricted to the formulation 



of HiLasso, is very similar to the one developed in Section III-A The main difference with our method is that 



they solve the sub-problem ( |III.10[ ) using a dual approach (based on conic duality) that finds the exact solution in 
a finite number of operations. Our method, being tailored to the specific case of HiLasso, provides such solution 
in closed form, requiring just two thresholdings, both linear in the dimension of X, nxm. 



IV. Theoretical guarantees 

In our current theoretical analysis, we study the case of a single measurement vector (signal) x (we comment on 
the collaborative case at the end of this section), and assume that there is no measurement noise or perturbation, so 
that X = Da. Without loss of generality, we further assume that the cardinality \Gr\ = g,r = 1, . . . ,q, that is, all 
groups in Q have the same size. The goal is to recover the code a, from the observed x, by solving the noise-free 
HiLasso problem: 

min {AV-cfa) + (1 - A) ||a||, s.t. x = Da} . (IV.15) 

Note that we have replaced the two regularization parameters Ai and A2 by a single parameter A, since scaUng 
does not effect the optimal solution. Therefore, we can always assume that Ai + A2 = 1. 
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Our goal is to develop conditions under which the HiLasso program of (IV. 15i will recover the true unknown 
vector a. As we will see, the resulting set of recoverable signals is a superset of those recoverable by Lasso, that 
is, HiLasso is able to recover signals for which Lasso (or Group Lasso) will fail to do so. 

We assume throughout this section that a has group sparsity k, namely, no more than k of the group vectors 
^[Gi]i i — ^, ■ ■ ■ ,<1, have non-zero norm. In addition, within each group, we assume that not more than s elements 
are non zero, that is, ||a[(3]||o < s. 



For A = 1, (IV.15i reduces to the Group Lasso problem, (II. 3 1, whereas with A = 0, (IV.lSl becomes equivalent 



to the Lasso problem, (II.2i. Both cases have been treated previously in the literature and sufficient conditions 
have been derived on the sparsity levels and on the dictionary D to ensure that the resulting optimization problem 
recovers the true unknown vector a. For example, in ||3], E9l . Il30l . conditions are given in terms of the restricted 
isometry property (RIP) of D. In an alternative line of work, recovery conditions are based on coherence measures, 
which are easier to compute IZTl . ||3T1 . Here, we follow the same spirit and consider coherence bounds that 
ensure recovery using the HiLasso approach. We also draw from |9| to briefly describe conditions under which 
the probability of error of recovering the correct groups, using the special case of the C-HiLasso with Ai = 
(C-GLasso), falls exponentially to as the number of collaborating samples n grows. Finally, recent theoretical 
results on block sparsity were reported in lf32l . In particular, bounds on the number of measurements required for 
block sparse recovery were developed under the assumption that the measurement matrix D has a basis of the 
null-space distributed uniformly in the Grassmanian. The model is a block-sparse model, without hierarchical or 
collaborative components. 

In this section we extend the group-wise indexing notation to refer both to subsets of rows and columns of 
arbitrary matrices as Wj^gj :— {wij : i E F,j E G}. This is, Wj^^g] = ij'pjWJjQ], where I and J are the 
identity matrices of the column and row spaces of W respectively. We define the sets ft = {1,2, ... ,p} and 
r — {1,2, .. . ,17}, and use S to denote the complement of a set of indices S, either with respect to ft or F, 
depending on the context. The set difference between S and T is denoted as S\T, represents the empty set, and 
jiS*! denotes the cardinality of S. 



A. Block-Sparse Coherence Measures 

We begin by reviewing previously proposed coherence measures. For a given dictionary D, the (standard) 
coherence is defined as ji := max^ j^igr |djdj|. This coherence was extended to the block-sparse setting in 
II27I . leading to the definition of block coherence: 

1/2 

where p(-) is the spectral norm, that is, p{Z) := Amax(Z'''Z), with Aniax(W) denoting the largest eigenvalue of 
the positive semi-definite matrix W. An alternate atom-wise measure of block coherence is given by the cross 
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coherence, 

X :=max{max{|dfdj|, i £ G, j e F) G,F e G ,G ^ F) . (IV.16) 

When g — \ (each block is a singleton), Djg^j = d,., so that as expected, x = Ms = M- While and x quantify 
global properties of the dictionary D, local block properties are characterized by the sub-coherence, defined as 

z/ max{max{|dfdj |, i,j e G,i 7^ G e G] . (IV.17) 

We define v = Q for g = 1. Clearly, if the columns of Dj^] are orthonormal for each group G, then v = {). 
Assuming the columns of D have unit norm, it can be easily shown that ^, v, x and /i^ all lie in the range 
[0, 1]. In addition, we can easily prove that v, ^b, X ^ I" our setting, a is block sparse, but has further internal 
structure: each sub-vector of a is also sparse. In order to quantify our ability to recover such signals, we expect 
that an appropriate coherence measure will be based on the definition of block sparsity, but will further incorporate 
the internal sparsity as well. Let M D^D denote the Gram matrix of D. Then, the standard block coherence 
/is is defined in terms of the largest singular value of an off-diagonal gxg sub-block of M. In a similar fashion, 
we will define sparse block coherence measures in terms of sparse singular values. As we will see, two different 
definitions will play a role, depending on where exactly the sparsity within the block enters. To define these, we 
note that the spectral norm p(Z) of a matrix Z can be defined as 

p(Z) := max|uTZv| s.t. |lu||2 = 1, |lv||2 = 1. 



Alternatively, we can define p(Z) as the largest singular value of Z, p{Z) := crmax(Z) = ^AmaxlZTZ), 

Amax(Z'rZ) := maxvT(ZTZ)v s.t. IMIj = 1. 

We now develop sparse analogs of p(Z) and Amax(Z^Z). As we will see, the simple square-root relation no longer 
holds in this case. The largest sparse singular value is defined as 1331 : 

p^^(Z) :=max|uTZv| s.t. Hujl^ - 1, ||v||2 - 1, ||u|lo < ,s, |lv|lo < s. (IV.18) 

Similarly, the largest sparse eigenvalue of Z^Z is defined as ll33l . Il34l . ||35]| . 

Aj^^JZTZ) :=maxvT(ZTZ)v s.t. ||v||2 = 1, ||v||o < s. (IV.19) 

The sparse matrix norm is then given by 

p^(Z) := v/A^,JZTZ). (IV.20) 

Note that, in general, p''(Z) is not equal to p''^{Z). It is easy to see that jO''*(Z) < p^iZ). For any matrix Z, 
p''*(Z) = p{Z[p^G]) and p^(Z) = p{Z[T]), where F,G,T are subsets of F = {1,2, ... ,g} of size s, chosen 



to maximize the corresponding singular value. Using (IV. 18i and (IV.20 1, we define two sparse block coherence 
measures: 

Pb'' ■■= max|%^^(DT^,D[f.]), G,^^eg,G^F|, (IV.21) 
I^b' ■■= maxS^^p%Bj^p[F]),G,FeG,G^FY (IV.22) 
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The choice of scaUng is to ensure that < fJ-B- 



Note that, while p^(Z) (also referred to in the literature as sparse principal component analysis (SPCA)) and 
p^^CZi) are in general NP-hard to compute, in many cases they can be computed exactly, or approximated, using 
convex programming techniques 



The following proposition establishes some relations between these new definitions and the standard coherence 
measures. 



Proposition 1. The sparse block-coherence measures ij-b^'^ , I^-b'^ satisfy 

< Ais"' < -M, < ^1B' < \ f^ii- 

9 \l 9 



(IV.23) 



Proof: The inequalities hb^" , IJ^b" > follow immediately from the definition. We obtain the upper bounds by 
rewriting p''^(Z) and p*(Z) and then using the Gersgorin Theorem, 



(a) 



'(Z) - Q.Z^p^G]) < 



[F,G]''[F.G\ 
(b) 



\ 



niaxy^|e/^^| < /smax|e/r|, 



(Z) = Ai/2^(ZT^]Z[T]) < ^ max^ |eJJ < ^smax|e; 



(IV.24) 



(IV.25) 



where e;^ and ej,. are the elements of E = Zj^pjZ^j^ r] and E' = Z^Z, and (a), (6) are a consequence of Gersgorin's 
disc theorem. 

The entries of Z = DJ^, jDj^^j for i ^ j have absolute value smaller than or equal to fi, and the size of Z is 
g X g. Therefore, \eke\ < sfjL^ and je'^^j < g/i^. Substituting these values into ( IV.24 1 and (IV.25 i concludes the 
proof of the upper bounds on ^b''^ and ^b'' ■ B 



B. Recovery Proof 

Our main recovery result is stated as follows. Suppose that a is a block A:-sparse vector with blocks of length 
g, where each block has sparsity exactly sj^and let x = Da. We rearrange the columns in D and the coefficients 
in a so that the first k groups, {Gi,G2, . . . ,Gk} correspond to the non-zero (active) blocks. Within each block 
Gi,i < k, the first s indices, represented by the set Si, correspond to the s nonzero coefficients in that block, and 
the index set = Gi \ Si represents its [g — s) inactive elements, so that = [Si T^]. The set Gq = Ui=i Gi 
contains the indices of all the active blocks of a, whereas Go = Q, \ G contains the inactive ones. Similarly, 
5o = Ui=i Si contains the indices of all the active coefficients/atoms in a and D respectively, Sq — VL \ Sq 
indexes the inactive coefficients/atoms in a/D, and Tq = UiLi Ti indexes the inactive coefficients/atoms within the 

^These conditions are non-limiting, since we can always complete the vector with zeros. 
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Fig. 3. Left: Indexing conventions, liere sliown for g = 8, k = 2 and s = 3. Shaded regions correspond to active elements/atoms. Active 
blocks are light-colored, and active elements/coefficients are dark colored. Here a' represents an alternate representation of x, x = Da'. Blocks 
and atoms that are not part of the true solution a are marked in red. Right: partitioning of a matrix W performed by the measure p[e,j='] (W) 
with £ = {Ei,E2} and T = {^1,^2,^3}, where \Ei\ = s and = g. 



active blocks. These indexing conventions are exemplified in Figure [3jleft). With these conventions we can write 

X = D[Go]a[Go] = D[So]a[So]- 

An important assumption that we will rely on throughout, is that the columns of Djc^j must be linearly 
independent for any Go as defined above. Under this assumption, Djg^jDjg^] is invertible and we can define the 
pseudo-inverse H := (Djj^jD[5g])~^Djg^j. For reasons that will become clear later, we will also need a second, 
oblique pseudo-inverse, Q :— (Dj'^^j(I — P)D[5p])~^Djj^j (I — P), where P is an orthogonal projection onto the 
range of D[ro]' th^t PDjj'i^] = D[To]- It is easy to check that 

QD[To] = 0, and QD[s„] - I. (IV.26) 

Equipped with these definitions we can now state our main result. 

Theorem 1. Let a be a block fc-sparse vector with blocks of length g, where each block has sparsity s. Let x = Da 



for a given matrix D. A sufficient condition for the HiLasso algorithm (IV.15 1 to recover a from x is that, for some 
a < 1, 



^[5„,eo](QD[c-]) < a, (IV.27) 

A(l - a) 

|HD[To]||i,i < 1. (IV.29) 



HD[g^]||i,i < 7, 7<l + ^i^^ ^, (IV.28) 



Here /9[£ jf](Z) := maxpfzjrJ^Ees Pi'^lE,F]), is the block spectral norm defined in ll27l . the blocks defined 
by the sets of index sets £ and T (see Figure [sjright)). We also define Sq ~ {Si : i = 1, . . . , k}, Qq = 
{Gi : i ~ k + 1, . . . ,q} and % ~ {Ti : i = 1, . . . , k}. Finally, |jZ||j^ ^ := max^ Hz^Hj^, where is the r-th column 
of Z. 
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The above theorem can be interpreted as follows. With 7 = 1, the conditions (IV.28 l-(IV.29 1 are sufficient both 



for Lasso (A — 0) and HiLasso to recover a. However, if there exists a 7 > 1 for which condition ( IV.28 1 holds, 
then HiLasso will be able to recover a in a situation where Lasso is not guaranteed to do so. The idea is that, for 
< A < 1, HiLasso trades off between the minimization of its £1 and £2 terms, by tightening the £2 term (a < 1) 
to improve group recovery, while loosening the £1 term (7 > 1). Also, although not yet clear from conditions 
( |IV.27[ )-( [lV.29| l, we will see in Theorem |2] that the final data independent bounds are also a relaxation of the ones 
corresponding to Group Lasso when the solutions are block-dense. Therefore, the proposed model outperforms 
both standard Lasso and Group Lasso with regard to recovery guarantees. This is also reflected in the experimental 
results presented in the next section. 



The sufficient conditions (IV.27 i-(IV.29 1 depend on D[5^] and therefore on the nonzero blocks in a. Go, and 



the nonzero locations within the blocks, Sq, which, of course, are not known in advance. Nonetheless, Theorem |2] 



provides sufficient conditions ensuring that (IV.27 i-(IV.29 1 hold, which are independent of the unknown signals, 
and depend only on the dictionary D. 

We now prove Theorem [T| 

Proof: To prove that ( IV.15[ ) recovers the correct vector a, let a' be an alternative solution satisfying x = Da'. We 
will show that \tpg{a.) + (1 — A) |la|l-^ < AV'e(s') + (1 ^ ll^'lli- Let the set Go contain the indices of all elements 
in the active blocks of a. Let Gq contain the indices of the active blocks in a'. Then x — DjCpjajG^] = Djgj^jajg,, j. 

By our assumptions, in each block of sl[Go] there are exactly s nonzero values. Let the set Sq C Go contain the 



indices of all nonzero elements in a. We thus have |5o| = ks. Using (IV.26l we can write 

a[So] = QD[So]a[So] = QD[Go]a[Go] = QD[G;,]a'[g, j . 



(IV.30) 



To proceed, we separate G'q into two parts: B = Gq D Gq, and B = G'q \ Gq, so that Gg = [B B] and 
D[Gj^]ajQ,j = 'D[b]^[b] + ^[Bl^js]' rewrite (IV.30i as 



a[So] = QD[B]a{s] + QDj^jaj-gj, 

and use the triangle inequality to obtain 

V-elafSo]) < V-elQCiBiajs]) +i/;g(QD[gja[5j). 
We now analyze the two terms in the right hand side of (IV.32i using ll27l Lemma 3]: 



(IV.31) 



(IV.32) 



Lemma 1. Let v e be a vector, Z e M'"^^' be a matrix, J" be a partition of = {1, 2, . . . ,p}, and £ a partition 
of {!,... ,m}. We then have that, V'e(Zv) < p[£,jF](Z)V'g(v)|^ 

^Note that the statement of Lemma[T]as shown here is actually a slight generalization of |27 Lemma 3], where the groups in the partitions 
need not have the same size. 
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Since B C Go, it follows from (IV.27i that pj^^^ gj(QDj^j) < a (here B is the set of the blocks that comprise 



B). To analyze /3[5o,e](QD[B]), we use its definition, 

P[5o,6](QD[s])=max ^ p((QD)[s,;.])=max ^ pHQB^s^^F]), (IV.33) 

EeSa Sj:j = l,...,k 

and analyze each of its terms. By definition of B, each F ^ B corresponds to some Gi ~ [Si Ti] for some i < k. 
We can thus write (QD)[5^. j?] = [(QD)[5^_5.] (QD)[5^. y.] ]. Then, by recalling that QD[7-g] = we see that 
(QD)[s^.,T.] = for all Now, when i ^ j we have (QD)[s^. s.j = I, thus p{{Q'D)[Sj,F]) = /o([I 0]) = 1. 



When i ^ j, (QD)[5^._s.] = 0, and p{{Q'D)[Sj,F]) = p{[0 0]) = in that case. From (IV.33 i we conclude that 



P[5o,e](QD[B]) = 1- Plugging into ( |IV.32| leads to 

"05 (a) < ■0e(a'i3]) + aV-elaj^]). 



(IV.34) 



For the £i term, we follow the same path as ( IV.30| l and (IV. 31 1, now using the Moore-Penrose pseudo-inverse H 



HDjgjaj-g., from which ||a||j^ < 



HDrBia 



a', 



follows. 



HDr 



instead, yielding a[s„] = HD[B]a{^j 

Using the fact that ||V^v||, ^ < ||W||, ^ ||v||, M, we get ||a||, < ||HD[b]||^^^ 

Now, since B C Go, and ||HD[Qg]||j^ = 1, we have that ||HD[b]||j i — Together with condition (IV.29i this 
yields. 



1.1 



+ 7 



(IV.35) 



Combining ( IV.34 1 and (IV.35 1 into the HiLasso cost function we get 

A^e(a) + (1-A)||a||i < xUg{a.[B]) + a^g{a.[-g^)] + {1 - X) 



(IV.36) 



Now, to finish the proof, we need to bound the right hand side of (IV.36 1 by Xipgl^a') + (1 — A) ||a'||j^, in order to 
show that the alternate a' is not a minimum of the HiLasso problem. For any 7 satisfying 

A(l - a)iPg{a' ) 



7 < 1 + 



we have. 



A[Ve(ajBj) + a^e(a;^,)] + (l-A)[ 



[B]^ 



where we have used the fact that ||a'||j^ = 



(1-A) 



+ 7 



[B] 



[B] 



<AV^e(a') + (l-A)||a'||,, 



(IV.37) 



and -0e(a') = i>g{a.'^g^) + ipg{a'^-^^). To obtain a 



signal independent relationship between 7 and a, we bound tpg (aj^j ) in terms of 



IB] 



[B] 



resulting in the condition 



which completes the proof. 



1 - 

i i 

A(l - a) 



[B]' 



7 < 1 



(1-A)V5 



< 1 



A(l -a)V'e(a{g]) 



(1-A) 
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We conclude that we can guarantee recovery for every choice of A as long as ( IV.27| i-(IV.29 1 are satisfied. Note 



that when A = (Lasso mode) we get 7 < 1, and, as expected, (IV.28 l-(IV.29 1 reduce to the Lasso recovery 
condition. Also, if a = 1 we have 7 < 1, meaning that we must tighten the constrains related to the £2 part of 
the cost function in order to relax the £1 part. For 7 > 1, the HiLasso conditions are a relaxation of the Lasso 
conditions, thus allowing for more signals to be correctly recovered. 

Theorem |2] below provides signal independent replacements of the conditions ( |IV.27| i-( [lV.29| l. The signal 



independent bound for (IV.27i derived here, depends on coherence measures between the dictionary D and its 
image under the projection I — P, C = (I — P)D. Since P depends on Sq, C itself is signal dependent. Thus, we 
need to maximize also over all possible sets Sq. These are defined as follows. 



Vp 



max < max < max • 



(dIc,)V2(dJc,)i 



/2 



i,jeG,t^j},Geg},So 



max|-p^(DT-g]C[j.]), G, F e G ,G ^ F ^ , So 
max|%^^(DT.]C[j.]), G, F e g, G ^ f| , ^0 



C max|max{(djcj) : z = 1, . . . ,p}, Sq^ . 
We are now in position to state the theorem. 



(IV.38) 

(IV.39) 
(IV.40) 
(IV.41) 



Theorem 2. Let ^p^ Mp' Mp ™d C, be the coherence measures defined respectively in (IV.16 1 and (IV.38 1-( IV.41 
Then the conditions ( |IV.27| l-( (lV.29| l are satisfied if 

C^^SMp 



ksx 

1 - (s - l)v -(k- l)sx 
ksv 



< a, 

< 7, 

< L 



(IV.42) 
(IV.43) 
(IV.44) 



1 - (s - - (fc - l)sx 

We also require the denominators in ( |IV.42[ )-( [lV.44| l to be positive. Note that, although the interpretation of ( |IV.42[ ) 
is rather counter-intuitive, it is easy to check that /ip, /ip < fiB- This can be seen when s = g (a case included in 



our theorems), in which case P = 0, C = D, and /ip = /ip = /ip. Therefore, the condition (IV.42 1 is a relaxation 
of the standard (dense) block-sparse recovery one ll27l Theorem 2]. 



Proof: Recall that QD,^, 



[So] ^ [Sol 



^[So]^[Go]- Since P[., 



is submultiplicative. 



Applying the definitions of p^g^ j and /ip we have. 



FeGo 



EeSo 



(IV.45) 



(IV.46) 



^There is a slight abuse of notation here, in that, in our case of non-square blocks, each norm p[. .] (■) in the right hand of the submultiplicativity 
inequality jIV.45| is actually a different norm. However, it is easy to see that the referred inequality holds in this case as well. 
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where the last inequality in (IV.46i derives from (IV.39i and the fact that each E E Sq belongs to some Gi, and 
\E\ ~ s, thus playing the role of the set T in the definition of p'^{-). Our goal is now to obtain a bound for 
P[5o,5o]((D[5jj]C[So])"^). To this end, we define Z = Dj'"^^jC[5p], and rewrite it as Z A"i(I - (I - AZA))A"^ 
Here A is a ksxks block-diagonal scaling matrix to be defined later. Assume for now that P[So,So]i^^ AZA) < 1. 
This allows us to apply the following result from fTl]: 

Lemma 2. Suppose that P[£,^](W) < 1. Then (I + W)-i = Er=o(-W)'=. 

By applying Lemma|2]to W = -I + AZA we can write Z^^ = A [ESo(I ^ AZA)'] A. With this, 

P[5o,5o](Z-^) < [p[5o.5ol(A)]%[5o,5o] E(I-^ZA)M < A,] (A)] E /'[^o.^o] ((I " AZA)') 

\i=0 ) 1=0 

< [P[....o](A)]^E(P[^oA,l(I-AZA))^ 'I I-pX^^^AZA) ' ^'""-''^ 

where in (a) and (c) we applied the submultiplicativity of p[. .](•), (6) is a consequence if the triangle inequality, 
and (d) is the limit of the geometric series, which is finite when /0[5„,5(,](I — AZA) < 1. 

We now bound — AZA). First, note that, since A is block-diagonal, we have that (I — AZA) [5. 5^] = 

I[Si,Sj] ~ A[g. 5.]Z[5._5^]A[5^. 5^.]. We then choose A to be a diagonal matrix with A^^ = (djcj)^^/^,i e 5*0. With 
this choice, we have that the diagonal elements of I[Sj.Sj] ^ ^lSj.Sj]'^lSj,Sj]-^lSj.Sj] equal to 1 for all j, and 
the off-diagonal elements are bounded by vp. Using Gersgorin Theorem we then have that 

P(l[s,,s,] - A[s^.,s,]Z[s,.,s,]A[s,.,5,]) <{s~ l)iyp, j = 1, . . . , fc. (IV.48) 

As for the off-diagonal sxs blocks of I — AZA, we have (I — AZA)[5. 5^.] = — A^^. s.jZjg. g^.jAjg^. 5^.]. We then 
have 

(a) (6) 

p((I-AZA)[s,^5^]) < piA[s„s^])p{Zis„s,])p(.Ms,,s,]) < Cidt^pK, (IV.49) 

where in (a) we used the submultiplicativity of p(-), and (b) derives from the definition of /ip, and the fact that, 
with our choice of A we have p(A[5. ^.j) < ( for all i. Now we can write the definition of — AZA) and 

bound its summation using ( |IV.48| l-( [lV.49| l, 

P[5o,5o](I - AZA) < max^ - A^s„s,]^[s,,s,]^ls„s,]) +■■■ 

... E P(A[s.s.](I-AZA)[5^,s^]A[s^.^s,]) i < (s-l)i^P+5A*?fC'- (IV.50) 

Si:i<kA^j J 

By our choice of A, /9(A[5. 5.]) < C and p{A^Si,Sj]) = for i 7^ j. Therefore (A) < ( as well. Using this 

together with ( |IV.50| ) in ( |IV.47| l, we obtain 
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To ensure that P[5o.5q](I — AZA) < 1, we need the denominator in the above equation to be positive. Now ( |IV.42[ ) 
follows by plugging ( |IV.46| l and ( |IV.51| l into ( |IV.45| l, 



/'[5o,eo](Q^ 



[Go 



< 



1^ - 1 - (s - l)vp + gn'-p'ik - 1)C2 



Finally, we use the same ideas to bound ||HDj^^||i i and derive (IV.43 i. Specifically, 



|HD[<5^j||i,i < ||(Df,^,D[s„])-^||i.i||Df^,,jD[^ 



11,1- 



Now 



(a) 



l(D[5o])'D[G^]lli.i = max^ Idjd,! < ksx 



jeGo 



(IV.52) 



(IV.53) 



where (a) follows from the definition of x ^nd the fact that |5o| = ks. It remains to develop a bound on 
||(Dj"_g^jD[5g])^-'-||i 1. To this end we express Dj^^jDj^^j = I + W, and bound 



IWIh 



r<k ieSr 



max <^ J2 Idld,l + Idld,l \\ < i'^ + -^k - l)x- 



(IV.54) 



since for all Sr,r < k, and all i e 5^, the first sum has (s — 1) nonzero elements bounded by i^, and the second 
sum has s{k — 1) elements bounded by x- Now, by requiring (s — + s{k — l)x < 1 we can apply Lemma |2] to 
W and follow the same path as the one that leads to (IV.50i, now using the matrix norm properties of || ||j^ j^, to 
obtain. 



fD7, ,D 



[So]^lSo] 



< 



1 



(IV.55) 



1- (s- l)z/ + s(fc- l)x 

Again, (s — + s{k — l)x < 1 is implicit in the requirement that the above denominator be positive. Plugging 
( |rv35| ) and ^V53\ into (|IV52]i yields ^VA3\ . 



The proof for (IV.44i is analogous to that of (IV.43 i, only that now the upper bound on |djdj|,z e So^j G Tq, 



is ly < 11. Continuing as before leads to (IV.44i. 



Theorems [T] and |2] are for the non-collaborative case. For the collaborative case there exist results that show that 
both the C-Lasso ||9l and C-GLasso ll26l will recover the true shared active set with a probability of error that 
vanishes exponentially with n. Since the in-group active sets are not necessarily equal for all samples in X, C- 
HiLasso could only help in recovering the group sparsity pattern. Since the C-GLasso is a special case of C-HiLasso 
when Ai = 0, we can conjecture that when Ai > 0, the accuracy of the C-HiLasso in recovering the correct groups 
will improve with larger n. Furthermore, since our results for HiLasso improve on those of the Group Lasso, it is 
to be expected that the accuracy of C-HiLasso, for an appropriate Ai > 0, will be better than that of C-GLasso. 

As an intuitive explanation to why this may happen, the proofs in f9] and flE) assume a continuous probability 
distribution on the non-zero coefficients of the signals, and give recovery results for the average case. On the other 
hand, the in-group sparsity assumption of C-HiLasso implies that only s out of g samples will be nonzero within 
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each group. This implies that, for the same group sparsity pattern, there will be much less (exactly a fraction s/g) 
non-zero elements in the possible signals compared to the ones that can occur under the hypothesis of C-GLasso. 
Since any assumed distribution of the signals under the in-group sparsity hypothesis has to be concentrated on this 
much smaller set of possible signals, they should be easier to recover correctly from solutions to the C-HiLasso 
program, compared to the dense group case of C-GLasso. 

V. Experimental results 

In this section we show the strength of the proposed HiLasso and C-HiLasso models. We start by comparing our 
model with the standard Lasso and Group Lasso using synthetic data. We created q dictionaries, D,., r — 1, . . . ,q, 
with g = 64 atoms of dimension m = 64, and i.i.d. Gaussian entries. The columns were normalized to have unit 
£2 norm. We then randomly chose k = 2 groups to be active at each time (on all the signals). Sets of n = 200 
normalized testing signals were generated, one per active group, as linear combinations of s <C 64 elements of 
the active dictionaries, = D^aJ. The mixtures were created by summing these signals and (eventually) adding 
Gaussian noise of standard deviation a. The generated testing signals have a hierarchical sparsity structure and 
while they share groups, they do not necessarily share the sparsity pattern inside the groups. We then built a single 
dictionary by concatenating the sub-dictionaries, D = [Di, . . . , D^], and used it to solve the Lasso, Group Lasso, 
HiLasso and C-HiLasso problems. Table |l] summarizes the Mean Squared Error (mse) and Hamming distance of 
the recovered coefficient vectors aj,j ~ 1, . . . ,n. We observe that our model is able to exploit the hierarchical 
structure of the data as well as the collaborative structure. Group Lasso selects in general the correct blocks but it 
does not give a sparse solution within them. On the other hand. Lasso gives a solution that has nonzero elements 
belonging to groups that were not active in the original signal, leading to a wrong model/class selection. HiLasso 
gives a sparse solution that picks atoms from the correct groups but still presents some minor mistakes. For the 
collaborative case, in all the tested configurations, no coefficients were selected outside the correct active groups, 
and the recovered coefficients are consistently the best ones. 

In all the examples, and for each method, the regularization parameters were the ones for which the best results 
were obtained. One can scale the parameter A2 to account for different number of signals. This situation is analogous 
to a change in the size of the dictionary, thus, A2 should be proportional to the square root of the number of signals 
to code. 

We then experimented with the USPS digits dataset, which has been shown to be well represented in the sparse 
modeling framework ll36l . Here the signals are vectors containing the unwrapped gray intensities of 16 x 16 
images (m — 256). We obtained each of the n = 200 samples in the testing data set as the mixture of two 
randomly chosen digits, one from each of the two drawn sets of digits. In this case we only have ground truth 
at the group level. We measure the recovery performance in terms of the average MSE of the recovered signals, 
AMSE = ^ J2r=i Sj^i ll^j ~ ^j\\2' ^^^^^ component corresponding to source r in the signal j, and 
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(T = 0.1 


417 / 22.0 1173 / 361.6 
330 / 19.8 163 / 13.3 


s = 8 


388 / 22.0 1184 / 318.2 
272 / 19.5 96 / 16.2 


g = 4 


1080 / 27.8 1916 / 221.7 
1009 / 29.8 742 / 30.2 


a = 0.2 


564 / 21.6 1182 / 378.3 
399 / 22.7 249 / 17.1 


s = 12 


1200 / 36.2 1166 / 350.4 
704 / 26.5 413 / 29.1 


g = 8 


1200 / 36.2 1166 / 350.4 
704/26.5 413/29.1 


a = 0.4 


965 / 22.7 1378 / 340.3 
656 / 19.5 595 / 27.4 


s = 16 


1641 / 43.9 1093 / 338.6 
1100 / 32.2 551 / 35.0 


q = l2 


1030/41.8 840/447.7 
662 / 26.4 4 / 29.8 



TABLE I 

Simulated signal results. In every table, each 2x2 cell contains the MSE (xlO"') and Hamming distance 
(MSE/Hamming) for Lasso (top.left), GLasso (top,right), HiLasso (bottom, left) and C-HiLasso (bottom, right). In the 

FIRST case (left) WE VARY THE NOISE cr WHILE KEEPING (J = 8 AND S = 8 FIXED. IN THE SECOND AND THIRD CASES WE HAVE (7 = 0. 

For the second experiment (center) we hxed g = 8 while changing s. In the third case we fix s = 12 and vary the 

NUMBER OF GROUPS q. BOLD BLUE INDICATES THE BEST RESULTS, ALWAYS OBTAINED FOR THE PROPOSED MODELS. IN ALL CASES, THE 

NUMBER OF ACTIVE GROUPS IS fc = 2. 



experiment 


Lasso 


GLasso 


HiLasso 


C-GLasso 


C-HiLasso 




AMSE 


Hamm 


AMSE 


Hamm 


AMSE 


Hamm 


AMSE 


Hamm 


AMSE 


Hamm 


1 digit 


0.06 


0.43 


0.07 


0.78 


0.02 


0.19 


0.01 


0.02 


0.02 


0.06 


1 digit+n 


0.08 


1.31 


0.08 


0.87 


0.04 


0.48 


0.05 


0.25 


0.02 


0.01 


2 digit 


0.09 


1.46 


0.08 


1.86 


0.02 


1.18 


0.01 


0.74 


0.02 


0.90 


2 digit+n 


0.11 


2.21 


0.08 


1.99 


0.04 


1.46 


0.09 


1.60 


0.03 


0.70 



TABLE II 



Noisy digit mixtures results. Four different cases are shown: when each signal is a single digit and when it is the 
mixture of two different (randomly selected) digits, with and without additive gaussian noise with standard 
deviation 10% of the peak value. for the 2 digits case, results are the average of 8 runs (in each round a new pair of 
digits was randomly selected). in the single digit case, the result is the average of the ten possible situations. both 
AMSE AND Hamming distance are shown, with bold blue indicating best. Without noise, both C-GLasso and C-HiLasso 

YIELD VERY GOOD RESULTS. HOWEVER, IN THE NOISY CASE, C-HILASSO IS CLEARLY SUPERIOR, SHOWING THE ADVANTAGE OF ADDING 
REGULARIZATION inside the groups from A ROBUSTNESS PERSPECTIVE. SEE ALSO FIGURe|4] 

is the recovered one. 

Using the usual training-testing spht for USPS, we first learned a dictionary for each digit. We then created a 
single dictionary by concatenating them. In Table |ll] we show the AMSE obtained while summing k — 2 different 
digits. We also consider the situation were only one digit is present. C-HiLasso automatically detects the number 
of sources while achieving the best recovery performance. As in the synthetic case, only the collaborative method 
was able to successfully detect the true active classes. In Figure |4] we relax the assumption that all the signals have 
to contain exactly the same type and amount of classes in the mixture, further demonstrating the flexibility of the 
proposed C-HiLasso model. 

We also used the digits dataset to experiment with missing data. We randomly discarded an average of 60% 
of the pixels per mixed image and then applied C-Hilasso. The algorithm is capable of correctly detecting which 
digits are present in the images. Some example results for this case are shown in Figure |5] Note that this is a 
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Fig. 4. In this example we used C-HiLasso to analyze mixtures where the data set contains different number and types of sources/classes. We 
used a set containing 180 mixtures of digit images. The first 150 images are obtained as the sum/mixture of a number "3" and an number "5" 
(randomly selected). Each of the last 30 images in the set are the mixture of three numbers: "3" ,"5" and "7" (the 180 images are of course 
presented at random, the algorithm is not a priori aware which images contain 2 sources and which contain 3). The figure shows the active sets 
of the recovered coefficients matrix A as a binary matrix the same size as A (atom indices in the vertical and sample indices in the horizontal), 
where black dots indicate nonzero coefficients. C-HiLasso managed to identify the active blocks while the sub-dictionary corresponding to "7" 
is mostly active for the last 30 images. The accuracy of this result depends on the relationship between the sub-dictionaries corresponding to 
each digit. 



quite different problem than the one commonly addressed in the matrix completion literature. Here we do not 
aim to recover signals that all belong to a unique unknown subspace, but signals that are the combination of two 
non-unique spaces to be automatically identified from the available dictionary. Such unknown spaces have common 
models/groups for all the signals in question (the coarse level of the hierarchy), but not necessarily the exact same 
atoms inside the groups and therefore do not necessarily belong to the same subspaces. Both levels of the hierarchy 
are automatically detected, e.g., the groups corresponding to "3" and "5," and the corresponding reconstructing 
atoms (subspaces) in each group, these last ones possibly different for each signal in the set. While we consider 
that the possible subspaces are to be selected from the provided dictionary (learned off-line from training data), 
in Section |VT] we discuss learning such dictionaries as part of the optimization as well (see also 1371 . (Ml). In 
such cases, the standard matrix completion problem becomes a particular case of the C-HiLasso framework (with 
a single group and all the signals having the same active set, subspace, in the group), naturally opening numerous 
theoretical questions for this new more general model Q 

We also compared the performance of C-HiLasso, Lasso, GLasso and C-GLasso (without hierarchy) in the task 
of separating mixed textures in an image. In this case, the set of signals X corresponds to all 12x12 patches in 
the (single) image to be analyzed. We chose 8 textures from the Brodatz dataset and trained one dictionary for 
each one of them using one half of the respective images (these form the g — 8 groups of the dictionary). Then 
we created an image as the sum of the other halves of the k — 2 textures. One can think of this experiment as a 

'Prof. Carin and collaborators have new results on the case of a single group and signals in possible different subspaces of the group, an 
intermediate model between standard matrix completion and C-HiLasso (personal communication). 
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Fig. 5. Example of recovered digits (3 and 5) from a mixture with 60% of missing components. From left to right: noiseless mixture, observed 
mixture with missing pixels highlighted in red, recovered digits 3 and 5, and active set recovered for all samples using the C-HiLasso and Lasso 
respectively. In the last two figures, the active sets are represented as in Figure [4] The coefficients blocks for digits 3 and 5 are marked as pink 
bands. Notice that the C-HiLasso exploits efficiently the hypothesis of collaborative group-sparsity, succeeding in recovering the connect active 
groups in all the samples. The Lasso, which lacks this prior knowledge, is clearly not capable of doing so, and active sets are spread over all 
the groups. 




Fig. 6. Texture separation results. Left to right: sample mixture, corresponding C-HiLasso separated textures, and comparison of the active 
set diagrams obtained by the Lasso (as in Figure |5}. The one for Lasso is shown on top, where all groups are wrongly active, and the one for 
C-HiLasso on bottom, showing that only the two correct groups are selected. 



generalization to the texture separation problem proposed in 11391 (without additive noise), where only two textures 
are present. The experiment was repeated for all possible combinations of two textures from the 8 possible ones. 
The results are summarized in Table III A detailed example is shown in Figure [6] For each algorithm, the best 
parameters were chosen using grid search, ensuring that those were not in the edges of the grid. For Lasso and 
C-HiLasso the best Ai is 0.0625. For GLasso and C-GLasso, the best A2 was, respectively, 0.05 and 75 (for the 
collaborative setting, we heuristically scale A2 with the number of signals as ^Jn. In this experiment, n w 512^, 



leading to such large value of A2). From Table III we can conclude that the C-HiLasso is significantly better than 
the competing algorithms, both in the MSB of the recovered signals (we show the AMSE of recovering both active 
signals), and in the average Hamming distance between the recovered group-wise active sets and the true ones. In 
the latter case we observe that, in many cases, the C-HiLasso active set recovery performance is perfect (Hamming 
distance 0) or near perfect, whereas the other methods seldom approach a Hamming distance lower than 1. 

Finally, we use C-HiLasso to automatically identify the sources present in a mixture of audio signals |40|. The 
goal is to identify the speakers talking simultaneously on a single recording. Here the task is not to fully reconstruct 
each of the unmixed sources from the observed signal but to identify which speakers are active. In this case, since 
the original sources do not need to be recovered, the modeling can be done in terms of features extracted from the 
original signals in a linear but non-bijective way. 
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2.80 0. 
1.36 0. 



0.33 0. 
2.06 0. 



0.96 0. 
1.97 0. 



1.02 1. 
2.25 0. 



2.26 0. 
2.50 0. 



4.37 1. 
2.51 0. 



0.09 0. 
0.53 0. 



,00 



110 
117 



214 
69 



3.65 
2.67 



0.00 

0.02 



3.69 
2.30 



0.07 
0.00 



3.55 
2.52 



1.00 
0.94 



4.12 
3.23 



0.53 

0.82 



4.47 
2.39 



0.08 

0.22 



3.77 
1.75 



1.00 
0.01 



18 

069 



074 63 
18 126 



78 
38 



107 
182 



76 141 
68 209 



129 
102 



52 
158 



42 

43 



1.74 
2.42 



0.00 
0.00 



1.42 
3.39 



1.00 2.25 
0.16 2.85 



LOO 
0.35 



3.48 
3.54 



0.44 3.49 
0.20 3.11 



0.32 
0.01 



4.09 
2.42 



0.13 4.23 
0.02 2.76 



0.12 
0.02 



0.31 
2.04 



1.00 1.83 
0.00 1.82 



1.00 
0.00 



19 
47 



47 
18 



91 
100 



83 
78 



35 
83 



62 
29 



49 

81 



72 
55 



3.16 
4.07 



1.00 
0.40 



4.20 
2.24 



1.00 
0.20 



1.13 
2.18 



1.00 
0.00 



85 174 
132 51 



191 234 
257 141 



105 112 
214 62 



123 145 
224 98 



85 76 
120 59 



4.42 0.42 
2.96 0.11 



3.14 0.97 
3.04 0.24 



107 447 
102 42 



7 

27 



43 
3 



240 219 
245 178 



68 105 
95 19 



162 141 
200 107 



21 93 
102 10 



182 148 
214 107 



26 
85 



10 



120 87 
107 71 



15 
41 



63 
9 



229 240 
245 162 



56 95 
117 27 



100 112 
102 51 



4.30 1.00 
1.90 0.18 



TABLE III 

Texture separation results. The rows and columns indicate the active textures in each cell. The upper triangle 
contains the amse (x 10*) results, while the lower triangle shows the hamming error in the group-wise active set 
RECOVERY. Within each cell, results are shown for the Lasso (top left). Group Lasso (bottom left). Collaborative 
Group Lasso (top right) and Collaborative Hierarchical Lasso (bottom right). The best results are in blue bold. 
Note that, both for the AMSE and Hamming distance, in 26 out of 28 cases, our model outperforms previous ones. 



Audio signals have in general very rich structures and their properties rapidly change over time. A natural 
approach is to decompose them into a set of overlapping local time-windows, where the properties of the signal 
remain stable. There is a straightforward analogy with the approach explained above for the texture segmentation 
case, where images were decomposed into collections of overlapping patches. These time-windows will collaborate 
in the identification. 

A challenging aspect when identifying audio sources is to obtain features that are specific to each source and at 
the same time invariant to changes in the fundamental frequency (pitch) of the sources. In the case of speech, a 
common choice is to use the short-term power spectrum envelopes as feature vectors ||4T1 (refer to ||40l for details 
on the feature extraction process and implementation). The spectral envelope in human speech varies along time, 
producing different patterns for each phoneme. Thus, a speaker does not produce an unique spectral envelope, but 
a set of spectral envelopes that live in a union of manifolds. Since such manifolds are well represented by sparse 
models, the problem of speaker identification is well suited for the proposed C-HiLasso framework, where each 
block in the dictionary is trained for the features corresponding to a given speaker, and the overlapping time-windows 
collaborate in detecting the active blocks. 

For this experiment we use a dataset consisting of recordings of five different German radio speakers, two female 
and three male. Each recording is six minutes long. One quarter of the samples were used for dictionary training, 
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Fig. 7. Speaker identification results. Eacli column corresponds to the sources identified for a specific time frame, tlie true ones marked by 
yellow dots. The vertical axis indicates the estimated activity of the different sources, where darker colors indicate higher energy. For each 
possible combination of speakers, 10 frames (15 seconds of audio) were evaluated. 

and the rest for testing. For each speaker, we learned a sub-dictionary from the training dataset. For testing, we 
extracted 10 non-overlapping frames of 15 seconds each (including silences made by the speakers while talking), 
and encoded them using C-HiLasso. The experiment was repeated for all possible combinations of two speakers, 
and all the speakers talking alone. The results are presented in Figure |7] C-HiLasso manages to detect automatically 
the number of sources very accurately, as well as the actual active speakers. Again, refer to ||40l for comparisons 
with other sparse modeling methods (showing the clear advantage of C-HiLasso) and results obtained for the 
identification of wind instruments in musical recordings. 

VI. Discussion 

We introduced a new framework of collaborative hierarchical sparse coding, where multiple signals collaborate 
in their encoding, sharing code groups (models) and having (possible disjoint) sparse representations inside the 
corresponding groups. An efficient optimization approach was developed, which guarantees convergence to the 
global minimum, and examples illustrating the power of this framework were presented. At the practical level, 
we are currently continuing our work on the applications of this proposed framework in a number of directions, 
including collaborative instruments separation in music, signal classification, and speaker recognition, following the 
here demonstrated capability to collectively select the correct groups/models. 

At the theoretical level, a whole family of new problems is opened by this proposed framework, some of which 
we already addressed in this work. A critical one is the overall capability of selecting the correct groups in the 
collaborative scenario, with missing information, and thereby of performing correct model selection and source 
identification and separation. Results in this direction will be reported in the future. 

Finally, we have also developed an initial framework for learning the dictionary for collaborative hierarchical 
sparse coding, meaning the optimization is simultaneously on the dictionary and the code. As it is the case with 
standard dictionary learning, this is expected to lead to significant performance improvements (see ll36l for the 
particular case of this with a single group active at a time). 
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